{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 简介\n",
    "\n",
    "赛题链接： "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 赛事背景\n",
    "\n",
    "对于移动设备厂商而言，获取当前手机用户的人口属性信息是非常困难的。基于用户的手机及日常使用应用程序的偏好准确地预测其人口属性信息是提升个性化体验、构建精准用户画像的基础。需要说明的是，本赛事数据已获得个人用户的充分认可和同意，并已进行适当的匿名处理以保护隐私。由于保密，我们不会提供有关如何获得性别和年龄数据的详细信息。\n",
    "\n",
    "\n",
    "## 赛事任务\n",
    "\n",
    "本次比赛有两个任务，分别对移动设备（device_id）进行性别和年龄的预测，这里包含二分类和回归两个问题，最终会将两个部分的分数结合起来进行排名。\n",
    "\n",
    "## 评审规则\n",
    "\n",
    "1. 数据说明\n",
    "\n",
    "赛题数据由训练集、测试集、事件数据组成。总设备id超过2w，包含设备信息、APP信息和事件信息，其中device_id为用户的唯一标识符，gender为用户性别，age为用户年龄。为了保证比赛的公平性，将会从中抽取2w条设备id为训练集，3千多个设备id作为测试集，同时会对部分字段信息进行脱敏。\n",
    "\n",
    "\n",
    "<img src=\"./pic/data.png\" width = \"300\" height = \"150\" alt=\"data\" align=center />\n",
    " \n",
    "\n",
    "2. 评估指标\n",
    "\n",
    "本次竞赛的评价标准由两部分组成，性别预测使用准确率指标，年龄预测使用1/(MAE+1)，最高分为2。评估代码参考："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score,mean_absolute_error\n",
    "# standard 为标准答案\n",
    "# submit 为选手提交答案\n",
    "score1 = accuracy_score(standard['gender'], submit['gender'])\n",
    "score2 = 1 / (1 + mean_absolute_error(standard['age'], submit['age']))\n",
    "# 最终分数\n",
    "final_score = score1 + score2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "3. 评测及排行\n",
    "\n",
    "1、赛事提供下载数据，选手在本地进行算法调试，在比赛页面提交结果。\n",
    "\n",
    "2、每支团队每天最多提交3次。\n",
    "\n",
    "3、排行按照得分从高到低排序，排行榜将选择团队的历史最优成绩进行排名。\n",
    "\n",
    "## 作品提交要求\n",
    "\n",
    "文件格式：预测结果文件按照csv格式\n",
    "\n",
    "提交文件大小：无要求\n",
    "\n",
    "提交次数限制：每支队伍每周最多3次\n",
    "\n",
    "预测结果文件详细说明：\n",
    "\n",
    "1) 以csv格式提交，编码为UTF-8，第一行为表头；\n",
    "\n",
    "2) 提交前请确保预测结果的格式与sample_submit.csv中的格式一致。具体格式如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "device_id,gender,age\n",
    "\n",
    "20000,0,0\n",
    "\n",
    "20001,0,0\n",
    "\n",
    "20002,0,0\n",
    "\n",
    "20003,0,0\n",
    "\n",
    "20004,0,0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 赛程规则\n",
    "\n",
    "正式赛\n",
    "\n",
    "8月16日——9月15日\n",
    "\n",
    "初赛截止成绩以团队在初赛时间段内最优成绩为准（不含测试排名）。\n",
    "\n",
    "初赛作品提交截止日期为9月8日17:00；正式赛名次公布日期为9月9日10:00。\n",
    "\n",
    "长期赛\n",
    "\n",
    "9月16日——10月24日\n",
    "\n",
    "因赛事以学习实践为主，正式赛将转变为长期赛，供开发者学习实践。本阶段提交后，系统会根据成绩持续更新榜单，但该阶段榜单不再进行公示和奖励。\n",
    "\n",
    "六、奖项设置\n",
    "\n",
    "本赛题设立一、二、三等奖各一名，具体详情如下：\n",
    "\n",
    "一等奖：1支队伍，周赛一等奖证书，奖金：1000元\n",
    "\n",
    "二等奖：1支队伍，周赛二等奖证书，奖金：800元\n",
    "\n",
    "三等奖：1支队伍，周赛三等奖证书，奖金：500元"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 赛题Baseline\n",
    "\n",
    "## 导入常用工具包&读取数据\n",
    "\n",
    "### 常用工具包导入\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T04:22:03.561360Z",
     "start_time": "2021-08-15T04:22:01.967251Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import os \n",
    "import gc\n",
    "import glob\n",
    "from joblib import Parallel, delayed\n",
    "import seaborn as sns\n",
    "from tqdm import tqdm\n",
    "from sklearn.model_selection import KFold \n",
    "import lightgbm as lgbm  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### 数据读取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T04:23:03.834314Z",
     "start_time": "2021-08-15T04:22:57.023863Z"
    }
   },
   "outputs": [],
   "source": [
    "data_path = './data/' \n",
    "df_tr               = pd.read_csv(data_path + 'train.csv')\n",
    "df_tr_app_events    = pd.read_csv(data_path + 'train_app_events.csv')\n",
    "df_te               = pd.read_csv(data_path + 'test.csv')\n",
    "df_te_app_events    = pd.read_csv(data_path + 'test_app_events.csv') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_tr_te_app_events = pd.concat([df_tr_app_events,df_te_app_events],axis=0,ignore_index = True) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据分析"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T07:58:56.340600Z",
     "start_time": "2021-08-15T07:58:56.286466Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>event_id</th>\n",
       "      <th>app_id</th>\n",
       "      <th>is_installed</th>\n",
       "      <th>is_active</th>\n",
       "      <th>device_id</th>\n",
       "      <th>tag_list</th>\n",
       "      <th>date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>14271</td>\n",
       "      <td>[549, 721, 704, 302, 303, 548, 183]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>14271</td>\n",
       "      <td>[713, 704, 548]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>14271</td>\n",
       "      <td>[549, 710, 704, 548, 172]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>14271</td>\n",
       "      <td>[548, 549]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>14271</td>\n",
       "      <td>[128, 1014]</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   event_id  app_id  is_installed  is_active  device_id  \\\n",
       "0         6       0             1          1      14271   \n",
       "1         6       1             1          1      14271   \n",
       "2         6       2             1          1      14271   \n",
       "3         6       3             1          1      14271   \n",
       "4         6       4             1          1      14271   \n",
       "\n",
       "                              tag_list  date  \n",
       "0  [549, 721, 704, 302, 303, 548, 183]     1  \n",
       "1                      [713, 704, 548]     1  \n",
       "2            [549, 710, 704, 548, 172]     1  \n",
       "3                           [548, 549]     1  \n",
       "4                          [128, 1014]     1  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_tr_te_app_events.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### gender\n",
    "\n",
    "- gender一共有两个不同的值，0和1，分布看起来相对均匀"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T07:54:58.957062Z",
     "start_time": "2021-08-15T07:54:58.954115Z"
    }
   },
   "outputs": [],
   "source": [
    "df_tr['gender'].value_counts().plot(kind = 'bar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "<img src=\"./pic/gender_distribution.png\" width = \"400\" height = \"150\" alt=\"gender_distribution\" align=center />\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### age\n",
    "\n",
    "- age主要分布在20-40岁之间"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T07:54:03.950608Z",
     "start_time": "2021-08-15T07:54:03.947861Z"
    }
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize = [10,8])\n",
    "sns.distplot(df_tr['age'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "<img src=\"./pic/age_distribution.png\" width = \"400\" height = \"150\" alt=\"age_distribution\" align=center />\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### date\n",
    "\n",
    "- date的分布较为均匀"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T08:01:37.438544Z",
     "start_time": "2021-08-15T08:01:37.187260Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<AxesSubplot:>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEACAYAAAC9Gb03AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAASeUlEQVR4nO3df4xdZ33n8fcHO0YkpGG3HgKNMzi7a9qmLInorKEb1DjbkjoprNWqreyiUiHSERXpTy1qdrsi2l2t1IqVVq0SalnFjdA2iZYWg6s1SdD+aFjSqLbTkNgpoa5Jm5EBhyQkDaAGw3f/uMfay+TO3GP7ztzx0/dLurrnPM9zzvlee+Yz5z5zzp1UFZKkdr1s2gVIklaWQS9JjTPoJalxBr0kNc6gl6TGGfSS1Lg1G/RJ9iY5meRIz/E/k+SxJEeT3LnS9UnS+SJr9Tr6JD8MvAB8pKreMGbsFuC/A/+qqp5N8uqqOrkadUrSWrdmz+ir6n7gmeG2JP80yT1JDif5dJLv67p+Abi9qp7ttjXkJamzZoN+CXuAX6qqHwT+DfChrv31wOuTfCbJg0m2T61CSVpj1k+7gL6SvBL4l8BHk5xufnn3vB7YAmwDNgGfTvKGqvrqKpcpSWvOeRP0DN59fLWqrh7RtwA8WFXfBL6Q5HEGwX9wFeuTpDXpvJm6qarnGYT4TwNk4Kqu++PAdV37RgZTOcenUackrTVrNuiT3AX8GfC9SRaSvAd4J/CeJJ8FjgI7uuH3Ak8neQz438D7q+rpadQtSWvNmr28UpI0GWv2jF6SNBkGvSQ1bk1edbNx48bavHnztMuQpPPG4cOHv1JVM6P61mTQb968mUOHDk27DEk6byT5m6X6nLqRpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNW5N3jDVx+Zb/sdE9vPEb/34RPYjSWuVZ/SS1Ljz9ox+LZrUuwzwnYakyTHoG+cPH0lO3UhS48ae0SfZC7wdOFlVbxjR/34Gf+Lv9P6+H5ipqmeSPAH8HfAt4FRVzU2qcElSP33O6O8Ati/VWVUfrKqrq+pq4N8Cf1pVzwwNua7rN+QlaQrGntFX1f1JNvfc3y7grnOqSM3z9wbS6prYHH2SCxmc+f/xUHMB9yU5nGR+zPbzSQ4lOfTUU09NqixJ+gdvkr+MfQfwmUXTNtdU1ZuAG4D3JfnhpTauqj1VNVdVczMzI/8aliTpLEwy6HeyaNqmqk50zyeBfcDWCR5PktTDRII+ySXAtcAnhtouSnLx6WXgeuDIJI4nSeqvz+WVdwHbgI1JFoBbgQsAqmp3N+wngPuq6mtDm14K7Ety+jh3VtU9kytdktRHn6tudvUYcweDyzCH244DV51tYZKkyfAjEKSOn4iqVhn00hrmDx9Ngp91I0mNM+glqXFO3Ug6I36ExfnHM3pJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1LixQZ9kb5KTSY4s0b8tyXNJHu4eHxjq257k8STHktwyycIlSf30OaO/A9g+Zsynq+rq7vEfAZKsA24HbgCuBHYlufJcipUknbmxQV9V9wPPnMW+twLHqup4Vb0I3A3sOIv9SJLOwaTm6H8oyWeTfDLJD3RtlwFPDo1Z6NokSatoEn9K8CHgdVX1QpIbgY8DW4CMGFtL7STJPDAPMDs7O4GyJEkwgTP6qnq+ql7olg8AFyTZyOAM/vKhoZuAE8vsZ09VzVXV3MzMzLmWJUnqnHPQJ3lNknTLW7t9Pg0cBLYkuSLJBmAnsP9cjydJOjNjp26S3AVsAzYmWQBuBS4AqKrdwE8Bv5jkFPANYGdVFXAqyc3AvcA6YG9VHV2RVyFJWtLYoK+qXWP6bwNuW6LvAHDg7EqTJE2Cd8ZKUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjRsb9En2JjmZ5MgS/e9M8kj3eCDJVUN9TyR5NMnDSQ5NsnBJUj99zujvALYv0/8F4NqqeiPwn4A9i/qvq6qrq2ru7EqUJJ2L9eMGVNX9STYv0//A0OqDwKYJ1CVJmpBJz9G/B/jk0HoB9yU5nGR+wseSJPUw9oy+ryTXMQj6tw41X1NVJ5K8GvhUks9V1f1LbD8PzAPMzs5OqixJ+gdvImf0Sd4I/D6wo6qePt1eVSe655PAPmDrUvuoqj1VNVdVczMzM5MoS5LEBII+ySzwMeDnqurzQ+0XJbn49DJwPTDyyh1J0soZO3WT5C5gG7AxyQJwK3ABQFXtBj4AfDfwoSQAp7orbC4F9nVt64E7q+qeFXgNkqRl9LnqZteY/puAm0a0HweueukWkqTV5J2xktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklq3NigT7I3yckkR5boT5LfTXIsySNJ3jTUtz3J413fLZMsXJLUT58z+juA7cv03wBs6R7zwO8BJFkH3N71XwnsSnLluRQrSTpzY4O+qu4HnllmyA7gIzXwIPCqJK8FtgLHqup4Vb0I3N2NlSStoknM0V8GPDm0vtC1LdUuSVpFkwj6jGirZdpH7ySZT3IoyaGnnnpqAmVJkmAyQb8AXD60vgk4sUz7SFW1p6rmqmpuZmZmAmVJkmAyQb8feFd39c1bgOeq6ovAQWBLkiuSbAB2dmMlSato/bgBSe4CtgEbkywAtwIXAFTVbuAAcCNwDPg68O6u71SSm4F7gXXA3qo6ugKvQZK0jLFBX1W7xvQX8L4l+g4w+EEgSZoS74yVpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGtcr6JNsT/J4kmNJbhnR//4kD3ePI0m+leQfd31PJHm06zs06RcgSVre2D8OnmQdcDvwNmABOJhkf1U9dnpMVX0Q+GA3/h3Ar1XVM0O7ua6qvjLRyiVJvfQ5o98KHKuq41X1InA3sGOZ8buAuyZRnCTp3PUJ+suAJ4fWF7q2l0hyIbAd+OOh5gLuS3I4yfzZFipJOjtjp26AjGirJca+A/jMommba6rqRJJXA59K8rmquv8lBxn8EJgHmJ2d7VGWJKmPPmf0C8DlQ+ubgBNLjN3JommbqjrRPZ8E9jGYCnqJqtpTVXNVNTczM9OjLElSH32C/iCwJckVSTYwCPP9iwcluQS4FvjEUNtFSS4+vQxcDxyZROGSpH7GTt1U1akkNwP3AuuAvVV1NMl7u/7d3dCfAO6rqq8NbX4psC/J6WPdWVX3TPIFSJKW12eOnqo6ABxY1LZ70fodwB2L2o4DV51ThZKkc+KdsZLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJalyvoE+yPcnjSY4luWVE/7YkzyV5uHt8oO+2kqSVtX7cgCTrgNuBtwELwMEk+6vqsUVDP11Vbz/LbSVJK6TPGf1W4FhVHa+qF4G7gR09938u20qSJqBP0F8GPDm0vtC1LfZDST6b5JNJfuAMt5UkrZCxUzdARrTVovWHgNdV1QtJbgQ+Dmzpue3gIMk8MA8wOzvboyxJUh99zugXgMuH1jcBJ4YHVNXzVfVCt3wAuCDJxj7bDu1jT1XNVdXczMzMGbwESdJy+gT9QWBLkiuSbAB2AvuHByR5TZJ0y1u7/T7dZ1tJ0soaO3VTVaeS3AzcC6wD9lbV0STv7fp3Az8F/GKSU8A3gJ1VVcDIbVfotUiSRugzR396OubAorbdQ8u3Abf13VaStHq8M1aSGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqXK+gT7I9yeNJjiW5ZUT/O5M80j0eSHLVUN8TSR5N8nCSQ5MsXpI03tg/Dp5kHXA78DZgATiYZH9VPTY07AvAtVX1bJIbgD3Am4f6r6uqr0ywbklST33O6LcCx6rqeFW9CNwN7BgeUFUPVNWz3eqDwKbJlilJOlt9gv4y4Mmh9YWubSnvAT45tF7AfUkOJ5k/8xIlSedi7NQNkBFtNXJgch2DoH/rUPM1VXUiyauBTyX5XFXdP2LbeWAeYHZ2tkdZkqQ++pzRLwCXD61vAk4sHpTkjcDvAzuq6unT7VV1ons+CexjMBX0ElW1p6rmqmpuZmam/yuQJC2rT9AfBLYkuSLJBmAnsH94QJJZ4GPAz1XV54faL0py8ell4HrgyKSKlySNN3bqpqpOJbkZuBdYB+ytqqNJ3tv17wY+AHw38KEkAKeqag64FNjXta0H7qyqe1bklUiSRuozR09VHQAOLGrbPbR8E3DTiO2OA1ctbpckrR7vjJWkxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuN6BX2S7UkeT3IsyS0j+pPkd7v+R5K8qe+2kqSVNTbok6wDbgduAK4EdiW5ctGwG4At3WMe+L0z2FaStIL6nNFvBY5V1fGqehG4G9ixaMwO4CM18CDwqiSv7bmtJGkFre8x5jLgyaH1BeDNPcZc1nNbAJLMM3g3APBCksd71DbORuAryw3Ib0/gKGdmbE2wNuuyJsCa+jpvv86nYFI1vW6pjj5BnxFt1XNMn20HjVV7gD096uktyaGqmpvkPs/VWqwJ1mZd1tSPNfW3FutajZr6BP0CcPnQ+ibgRM8xG3psK0laQX3m6A8CW5JckWQDsBPYv2jMfuBd3dU3bwGeq6ov9txWkrSCxp7RV9WpJDcD9wLrgL1VdTTJe7v+3cAB4EbgGPB14N3Lbbsir2S0iU4FTcharAnWZl3W1I819bcW61rxmlI1cspcktQI74yVpMYZ9JLUOINekhrXTNAneXOS7+qWX5HkPyT5kyS/neSSadcHkOStSX49yfVTrmNrkn/RLV/Z1XTjNGtaLMlH1kANv5zk8vEjV1eS70vyI0leuah9+7RqWou6f6ff6D6H63e65e+fYj0bkrwryY926z+b5LYk70tywYoeu5VfxiY5ClzVXemzh8HVP38E/EjX/pNTqOnPq2prt/wLwPuAfcD1wJ9U1W9NoaZbGXz20HrgUwzuVP4/wI8C91bVf55CTYsvuQ1wHfC/AKrqX692TQBJngO+Bvw1cBfw0ap6ahq1DNX0ywy+jv4SuBr4lar6RNf3UFW9aZnNV12Sd1fVH0zhuL8B7GLwsSsLXfMmBpd43z2l770/ZPB9dyHwVeCVwMcYZFSq6udX7OBV1cQD+Muh5YcW9T08pZr+Ymj5IDDTLV8EPDqlmh5lcKnrhcDzwHd17a8AHplSTQ8B/w3YBlzbPX+xW752il9Tf8HgXe/1wIeBp4B7gJ8HLp7i/98ru+XNwCEGYf8dX29r5QH87ZSO+3ngghHtG4C/mlJNj3TP64EvA+u69az0916fO2PPF0eGzh4+m2Suqg4leT3wzSnV9LIk/4hBWKS6s8Gq+lqSU1Oq6VRVfQv4epK/rqrnu5q+keTbU6ppDvgV4DeB91fVw0m+UVV/OqV6Tquq+jZwH3Bf9/b6BgZniv8FmJlCTeuq6oWuuCeSbAP+KMnrGP2RIysuySNLdQGXrmYtQ74NfA/wN4vaX9v1TcPLuhtHL2JwonUJ8AzwcmBFp25aCvqbgN9J8u8ZfEDQnyV5ksGHqt00pZouAQ4z+IKvJK+pqi91c6tT+aYEXkxyYVV9HfjB043d7zGm8g3Qhel/TfLR7vnLrI2vze/4P6qqbzK4s3t/kldMpyS+lOTqqnq4q+mFJG8H9gL/fEo1XQr8GPDsovYAD6x+OQD8KvA/k/wV//+DFWeBfwbcPKWaPgx8jsE76t8EPprkOPAWBlNMK6aZOfrTklwM/BMGQbFQVV+eckkvkeRC4NKq+sIUjv3yqvr7Ee0bgddW1aOrXdOIWn4cuKaq/t2U63h9VX1+mjUslmQTg3dlXxrRd01VfWYKNX0Y+IOq+r8j+u6sqp9d7Zq6Y7+MwUelX8bgh84CcLB7RzsVSb4HoKpOJHkVg9+N/W1V/fmKHre1oJckfadmLq+UJI1m0EtS4wx6SWqcQS9JjTPoJalx/w95BU0LWBrTqwAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "df_tr_te_app_events['date'].value_counts().plot(kind = 'bar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### event_id\n",
    "\n",
    "- event_id的分布并不均衡，有很多值为1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T08:38:28.961030Z",
     "start_time": "2021-08-15T08:38:28.505471Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3007317    247\n",
       "2611873    247\n",
       "2793993    245\n",
       "1268100    245\n",
       "3105953    245\n",
       "          ... \n",
       "1456316      1\n",
       "178732       1\n",
       "3110362      1\n",
       "2946474      1\n",
       "1349305      1\n",
       "Name: event_id, Length: 556375, dtype: int64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_tr_te_app_events['event_id'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### app_id\n",
    "\n",
    "- app_id的较小的出现次数较多。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T08:43:14.460608Z",
     "start_time": "2021-08-15T08:42:52.317471Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/wangrong/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).\n",
      "  warnings.warn(msg, FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<AxesSubplot:xlabel='app_id', ylabel='Density'>"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZ4AAAEHCAYAAACeFSCEAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAspUlEQVR4nO3dfZRc1Xnn+++vql8ktRB6a0AgQCIWsQVOMCggxy+TZV9iiXEsv1zfQLCFiXMVObCS2HdmIsbXmWTGWWMnuVkJCYNMcrHBMcbEDkYTy1exSbBnHN6EIYAADUK2QUIY8SakbqlbXfXcP86u7lKpuqok1Wm1un6ftWrVqX32PuepRvTTe5999lFEYGZmNlEKxzsAMzPrLE48ZmY2oZx4zMxsQjnxmJnZhHLiMTOzCdV1vAOYrObPnx+LFi063mGYmZ1QHnrooZcior9RHSeecSxatIjNmzcf7zDMzE4okn7SrI6H2szMbEI58ZiZ2YRy4jEzswnlxGNmZhPKicfMzCaUE4+ZmU2oXBOPpBWStkraJmldnf2SdH3a/6ikC5u1lfQnkp5K9e+UNLtq33Wp/lZJ76kqv0jSY2nf9ZKU49c2M7MGcks8korADcBKYClwhaSlNdVWAkvSaw1wYwttvwOcHxE/B/wv4LrUZilwOXAesAL4b+k4pOOuqTrXinZ/XzMza02ePZ6LgW0RsT0ihoHbgVU1dVYBt0bmPmC2pAWN2kbEP0bESGp/H7Cw6li3R8RQRPwI2AZcnI43KyLujezhQ7cC78/rS9d6Yc8BVvz593n+tf0TdUozs0ktz8RzBvBc1ecdqayVOq20Bfh14NstHGtHC8fKxZMvvM5TL+zl6Rf3TdQpzcwmtTwTT73rKLWPOx2vTtO2kj4NjABfOdZjVR1zjaTNkjbv3r27XpUjNjCUdc6GR8ptOZ6Z2Ykuz8SzAziz6vNC4PkW6zRsK+kq4L3AlTH27O5Gx1pYp/wwEXFTRCyLiGX9/Q3XuGuZE4+Z2aHyTDwPAkskLZbUQ3bhf0NNnQ3A6jS7bTmwJyJ2NWoraQXwe8D7ImKw5liXS+qVtJhsEsED6Xh7JS1Ps9lWA3fl9q1r7BsqATBcKk3UKc3MJrXcVqeOiBFJ1wKbgCJwc0RskbQ27V8PbAQuI5sIMAhc3ahtOvRfAb3Ad9Ks6PsiYm069h3AE2RDcNdEROW3/SeALwHTya4JVa4L5c49HjOzQ+X6WISI2EiWXKrL1ldtB3BNq21T+RsanO+PgD+qU74ZOL/lwNtoXyXxlOpeVjIz6zheuSBn+9zjMTM7hBNPzjzUZmZ2KCeenDnxmJkdyoknZ2PXeDyrzcwMnHhy52s8ZmaHcuLJ2UDlPh4nHjMzwIknd2NDbU48ZmbgxJO7yuSCIfd4zMyAnG8g7XSlcjA4nA21Pf3Tfdx2/7Oj+37tkrOOV1hmZseVezw5GhgeGd0ulb1ygZkZOPHkqjLMBk48ZmYVTjw5qk48I2Vf4zEzAyeeXFUeiQAw4h6PmRngxJOrSo+np1jwUJuZWeLEk6O9B7LEM6O3yIgfi2BmBjjx5KrS4+nr6fJQm5lZ4sSTo8p06hk9RUqeXGBmBuSceCStkLRV0jZJ6+rsl6Tr0/5HJV3YrK2kD0vaIqksaVlV+ZWSHql6lSVdkPbdk45V2XdKnt+7orJcTl+vezxmZhW5JR5JReAGYCWwFLhC0tKaaiuBJem1BrixhbaPAx8Evl99oIj4SkRcEBEXAB8FfhwRj1RVubKyPyJebNsXbWDfgREKgmndBV/jMTNL8uzxXAxsi4jtETEM3A6sqqmzCrg1MvcBsyUtaNQ2Ip6MiK1Nzn0F8NV2fpmjMTA0Qk9Xga6CZ7WZmVXkmXjOAJ6r+rwjlbVSp5W2jfwqhyeeL6Zhts9IUr1GktZI2ixp8+7du4/gdPXtGyrR21WkWJBvIDUzS/JMPPV+udf+2T9enVba1j+pdAkwGBGPVxVfGRFvBt6RXh+t1zYiboqIZRGxrL+/v5XTNTQwNEJvV4FiQZQDyuFej5lZnolnB3Bm1eeFwPMt1mml7Xgup6a3ExE70/te4DayobzcDQxniaerkOVRD7eZmeWbeB4ElkhaLKmHLCFsqKmzAVidZrctB/ZExK4W2x5GUgH4MNk1oUpZl6T5absbeC/ZBIXcZT2eohOPmVmV3J7HExEjkq4FNgFF4OaI2CJpbdq/HtgIXAZsAwaBqxu1BZD0AeAvgX7gW5IeiYj3pNO+E9gREdurQukFNqWkUwS+C/x1Xt+7WqkcFApQLGb53VOqzcxyfhBcRGwkSy7VZeurtgO4ptW2qfxO4M5x2twDLK8pGwAuOsLQ26IcIDTa4xnx46/NzLxyQZ5K5UDCQ21mZlWceHJUjqAgUaz0eJx4zMycePI01uPxNR4zswonnhyVI5BEVzENtfkaj5mZE0+eygEF4aE2M7MqTjw5KpWzazxdTjxmZqOceHJUjkCM9Xg8q83MzIknV+VyusbjyQVmZqOceHJUiqBwyH08nlxgZubEk6NygCSKxcrKBe7xmJk58eSoXLNygYfazMyceHI1NtSW/Zg9ucDMzIknV6U0ucD38ZiZjXHiyVFE9gOurFzgx1+bmTnx5KrS4ylIFAQlTy4wM3PiyVPlGg9kN5F6qM3MzIknV5EWCQUnHjOzilwTj6QVkrZK2iZpXZ39knR92v+opAubtZX0YUlbJJUlLasqXyRpv6RH0mt91b6LJD2WjnW9KtkgZ5XHIkA2s803kJqZ5Zh4JBWBG4CVwFLgCklLa6qtBJak1xrgxhbaPg58EPh+ndM+ExEXpNfaqvIb0/Er51px7N+wsYhIq1NnmaerIN9AamZGvj2ei4FtEbE9IoaB24FVNXVWAbdG5j5gtqQFjdpGxJMRsbXVINLxZkXEvRERwK3A+4/1yzVTGVVT1TWeUjjxmJnlmXjOAJ6r+rwjlbVSp5W29SyW9LCk70l6R9U5drRyLElrJG2WtHn37t0tnG585ZRkROrxFN3jMTODfBNPvesotb95x6vTSttau4CzIuItwKeA2yTNOpJjRcRNEbEsIpb19/c3OV1jlVUKCodc43HiMTPryvHYO4Azqz4vBJ5vsU5PC20PERFDwFDafkjSM8C56RwLj+RY7VDp8RQOmdXmyQVmZnn2eB4ElkhaLKkHuBzYUFNnA7A6zW5bDuyJiF0ttj2EpP40KQFJ55BNItiejrdX0vI0m201cFcbv2ddtdd4ujyd2swMyLHHExEjkq4FNgFF4OaI2CJpbdq/HtgIXAZsAwaBqxu1BZD0AeAvgX7gW5IeiYj3AO8E/rOkEaAErI2IV1I4nwC+BEwHvp1euaoMq1VmbncXCxwYOpj3ac3MJr08h9qIiI1kyaW6bH3VdgDXtNo2ld8J3Fmn/BvAN8Y51mbg/COJ/ViVa67xzOgp8tO9ByYyBDOzSckrF+SkMnW60uOZ0VNkcKh0PEMyM5sUnHhyMjadOtPX28VwqczBkicYmFlnc+LJSWUCW2G0x5ONag4Ou9djZp3NiScnpTj8Gg/AwNDI8QrJzGxScOLJSblmVltfr3s8ZmbgxJOb0Ws8qcfTV+nxDLvHY2adzYknJ7VL5syo9Hg81GZmHc6JJydjKxdkmWd6dxEBAx5qM7MO58STk9rp1MWCmNZdZNBDbWbW4Zx4cjI21Da2OHZfb5EB30RqZh3OiScntdd4ILuXxz0eM+t0Tjw5iZprPJDNbPN0ajPrdE48OSnVTKeGbGabbyA1s07nxJOTutd4eooMDJeI8HN5zKxzOfHkJOr0ePp6uyiVg+ERLxRqZp3LiScnow+CYyzzVBYK9b08ZtbJnHhyUrtIKIwtm+OZbWbWyXJNPJJWSNoqaZukdXX2S9L1af+jki5s1lbShyVtkVSWtKyq/FJJD0l6LL2/q2rfPelYj6TXKXl+bxh7LEL1rLbKsjm+l8fMOlluj76WVARuAC4FdgAPStoQEU9UVVsJLEmvS4AbgUuatH0c+CDwhZpTvgT8SkQ8L+l8YBNwRtX+K9MjsCdEuU6Pp7cry/NDI048Zta5cks8wMXAtojYDiDpdmAVUJ14VgG3RnYl/j5JsyUtABaN1zYinkxlh5wsIh6u+rgFmCapNyKG8vhyzdQ++hqgJyUeP4XUzDpZnkNtZwDPVX3ewaE9kEZ1WmnbyIeAh2uSzhfTMNtnVJu1EklrJG2WtHn37t1HcLrDleusXNBdzH7cntVmZp0sz8RT75d77Q0s49VppW39k0rnAZ8HfrOq+MqIeDPwjvT6aL22EXFTRCyLiGX9/f2tnG5ctatTA/QUKz0e38djZp0rz8SzAziz6vNC4PkW67TS9jCSFgJ3Aqsj4plKeUTsTO97gdvIhgFzNTadekxXMfs07KE2M+tgeSaeB4ElkhZL6gEuBzbU1NkArE6z25YDeyJiV4ttDyFpNvAt4LqI+EFVeZek+Wm7G3gv2QSFXI1NLhhLPQWJ7qI46KE2M+tgLSUeSd+Q9G8ltZyoImIEuJZsdtmTwB0RsUXSWklrU7WNwHZgG/DXwG81apti+YCkHcBbgW9J2pSOdS3wBuAzNdOme4FNkh4FHgF2pnPlqvbR1xXdxYJ7PGbW0Vqd1XYjcDVwvaS/A74UEU81axQRG8mSS3XZ+qrtAK5ptW0qv5NsOK22/LPAZ8cJ5aJmsbbb6FBbTeLp6Sp4VpuZdbSWejAR8d2IuBK4EPgx8B1J/yLp6jR8ZTXqDbVB6vF4qM3MOljLQ2eS5gEfA34DeBj4C7JE9J1cIjvBVTo1tYmnp1jwrDYz62gtDbVJ+nvgjcCXyVYH2JV2fU3ShK0GcCIZvcZTU+5rPGbW6Vq9xvM36ZrLqMqqABGxbLxGnaw87jUe+SmkZtbRWh1qq3fR/t52BjLVlHyNx8ysroY9HkmnkS1VM13SWxgbOZoFzMg5thPa2MoFh5Zn13iceMysczUbansP2YSChcCfVZXvBf5jTjFNCWNDbTU9ni73eMysszVMPBFxC3CLpA9FxDcmKKYpoVRnkVDwrDYzs2ZDbR+JiL8FFkn6VO3+iPizOs2MxvfxHCyViYjDekNmZp2g2VBbX3qfmXcgU81406l7ugoEMDRSZlp3ccLjMjM73poNtX0hvf/hxIQzdZTqPPoaoDutUD04XHLiMbOO1OoioX8saZakbkl3S3pJ0kfyDu5EVu/R1zD2TJ79B30vj5l1plbv4/nliHid7JECO4BzgX+fW1RTQKNZbQD7h0cmPCYzs8mg1cRTWQj0MuCrEfFKTvFMGaVxHosw2uMZ9pRqM+tMrS6Z898lPQXsB35LUj9wIL+wTnzl8viz2gAG3eMxsw7V6mMR1pE9eG1ZRBwEBoBVeQZ2oitFUKy9wEM2qw18jcfMOlerPR6AN5Hdz1Pd5tY2xzNllOPwiQUwNqttvxcKNbMO1eqsti8Dfwq8HfiF9Gq6KrWkFZK2StomaV2d/ZJ0fdr/qKQLm7WV9GFJWySVJS2rOd51qf5WSe+pKr9I0mNp3/WagDs3y+U4bJgNPKvNzKzVHs8yYGl6VHVLJBWBG4BLyWbCPShpQ0Q8UVVtJbAkvS4he8T2JU3aPg58EPhCzfmWApcD5wGnA9+VdG5ElNJx1wD3kT1OewXw7Va/y9EolesPtVVmtfnRCGbWqVqd1fY4cNoRHvtiYFtEbI+IYeB2Dr8utAq4NTL3AbMlLWjUNiKejIitdc63Crg9PSPoR8A24OJ0vFkRcW9KnLcC7z/C73LEygHFBj2eA+7xmFmHarXHMx94QtIDwFClMCLe16DNGcBzVZ93kPVqmtU5o8W29c53X51jHUzbteWHkbSGrGfEWWed1eR0jZUjDptKDdWz2px4zKwztZp4/uAojl3vOkrtUN14dVpp2+r5Wj5WRNwE3ASwbNmyY1pCeryhtmJBFAvyNR4z61gtJZ6I+J6ks4ElEfFdSTOAZguN7QDOrPq8EHi+xTo9LbRt9Xw70vaRHOuYlceZTg3ZzDbPajOzTtXqrLb/E/g6Yxf0zwC+2aTZg8ASSYsl9ZBd+N9QU2cDsDrNblsO7ImIXS22rbUBuFxSr6TFZBMWHkjH2ytpeZrNthq4q4WvfUzKDR570FMsOPGYWcdqdajtGrIL/vcDRMTTkk5p1CAiRiRdC2wi6x3dHBFbJK1N+9eTzTC7jGwiwCBwdaO2AJI+APwl0A98S9IjEfGedOw7gCeAEeCaNKMN4BPAl4DpZLPZcp3RBmmobZzE010sMOihNjPrUK0mnqGIGK78BZ9uIm16DSQiNpIll+qy9VXbQZbUWmqbyu8E7hynzR8Bf1SnfDNwfrN426lUZtyhtp4u93jMrHO1Op36e5L+IzBd0qXA3wH/Pb+wTnwxzqw2yHo8+w96rTYz60ytJp51wG7gMeA3yXoi/3deQU0F463VBu7xmFlna3VWW1nSN4FvRsTufEOaGppe43HiMbMO1bDHk2ab/YGkl4CngK2Sdkv6/YkJ78QVAYVxejzTugrsG/JQm5l1pmZDbb8LvA34hYiYFxFzyVYQeJukT+Yd3ImsVI66q1MD9PV28crA8MQGZGY2STRLPKuBK9LaZwBExHbgI2mfjaMU9VenhizxDA6XfJ3HzDpSs8TTHREv1Ram6zzddepbUh5nyRyAvp5s0YeXB4bq7jczm8qaJZ5G40EeK2qg3KDHM7M3m9Px8j7/CM2s8zSb1fbzkl6vUy5gWg7xTBmlBpML+iqJxz0eM+tADRNPRDRbCNTGUS4HxQaTC8A9HjPrTK3eQGpHqNHq1KNDbZ7ZZmYdyIknJ6Vyg9WpuwpM6y54SrWZdSQnnpyUY/yVCwDm9fXy0j5f4zGzzuPEk5PxnkBaMW9mj6/xmFlHcuLJSTkYd3VqgHl9PR5qM7OO5MSTk0aTCwDmzezlZQ+1mVkHyjXxSFohaaukbZLW1dkvSden/Y9KurBZW0lzJX1H0tPpfU4qv1LSI1WvsqQL0r570rEq+xo+PbUdGq1ODVmP5+WBYbJn4ZmZdY7cEo+kInADsBJYClwhaWlNtZXAkvRaA9zYQtt1wN0RsQS4O30mIr4SERdExAXAR4EfR8QjVee6srI/Il5s9/etVW5wAylk13iGRsoMeL02M+swefZ4Lga2RcT2iBgGbgdW1dRZBdwamfuA2ZIWNGm7Crglbd8CvL/Oua8AvtrWb3OEyg1WpwaY29cL4OE2M+s4eSaeM4Dnqj7vSGWt1GnU9tSI2AWQ3usNm/0qhyeeL6Zhts9ovBts2qjRE0gh6/GAbyI1s86TZ+Kp91u39oLGeHVaaVv/pNIlwGBEPF5VfGVEvBl4R3p9dJy2ayRtlrR59+5je9Bqo0VCAeaP9niceMyss+SZeHYAZ1Z9Xgg832KdRm1/mobjSO+112sup6a3ExE70/te4DayobzDRMRNEbEsIpb19/c3/HLNZENt4yeeU2ZliWfXnv3HdB4zsxNNnonnQWCJpMWSesgSwoaaOhuA1Wl223JgTxo+a9R2A3BV2r4KuKtyMEkF4MNk14QqZV2S5qftbuC9QHVvKBfNhtpOOamXk6d389QLe/MOxcxsUmn2WISjFhEjkq4FNgFF4OaI2CJpbdq/HtgIXAZsAwaBqxu1TYf+HHCHpI8Dz5Ilmop3AjvSU1IreoFNKekUge8Cf53Hd65WLtOwxyOJN552Ek/tqvfUCTOzqSu3xAMQERvJkkt12fqq7QCuabVtKn8ZePc4be4BlteUDQAXHWHoxyy7xtO4zpsWzOKOzc9lw3LNKpuZTRFeuSAnzdZqA3jjaScxOFziuVcHJygqM7Pjz4knJ+Vo3ot504JZADy5y9d5zKxzOPHkpBw0XDIH4NxTT0KCJ32dx8w6iBNPTkpNVi4AmN5TZPG8Pp56wYnHzDqHE09OWp0w8MYFJ3lKtZl1FCeenJSaPIG0YvH8Pna+up+RUnkCojIzO/5ynU7dyZpNLrjt/mcB2PnqAUbKwU3f387sGT382iVnTVSIZmbHhXs8OWl2A2nFnBndALw6eDDvkMzMJgUnnpxkS+Y0rzd7RrZK9WuDXizUzDqDE09Oyi1e45k92uNx4jGzzuDEk4OIICJbj62Z7mKBmb1dvOahNjPrEE48OSiVs0cHNVsyp2LOjG73eMysYzjx5CDlnZYTz+wZPe7xmFnHcOLJQTmyzNPqA7bnzOjmtf0HR9uZmU1lTjw5GB1qazHzzJ7RQ6kc7DswkmdYZmaTghNPDkpx5Nd4wDPbzKwzOPHkINLqN63MaoPqe3l8ncfMpr5cE4+kFZK2StomaV2d/ZJ0fdr/qKQLm7WVNFfSdyQ9nd7npPJFkvZLeiS91le1uUjSY+lY16vVjHCURns8LZ5l9vSsx7NnvxOPmU19uSUeSUXgBmAlsBS4QtLSmmorgSXptQa4sYW264C7I2IJcHf6XPFMRFyQXmurym9Mx6+ca0XbvmgdRzqduqerQHdR7BvyNR4zm/ry7PFcDGyLiO0RMQzcDqyqqbMKuDUy9wGzJS1o0nYVcEvavgV4f6Mg0vFmRcS9ERHArc3aHKtIPZ5WHosA2ZDczN4uJx4z6wh5Jp4zgOeqPu9IZa3UadT21IjYBZDeT6mqt1jSw5K+J+kdVefY0SQOACStkbRZ0ubdu3c3+37jqgy1tbJIaIUTj5l1ijwTT73furU3qoxXp5W2tXYBZ0XEW4BPAbdJmnUkx4qImyJiWUQs6+/vb3K68R3pdGpIicfTqc2sA+SZeHYAZ1Z9Xgg832KdRm1/mobPKsNoLwJExFBEvJy2HwKeAc5Nx1rYJI62KqdZba0OtQH0ucdjZh0iz8TzILBE0mJJPcDlwIaaOhuA1Wl223JgTxo+a9R2A3BV2r4KuAtAUn+alICkc8gmEWxPx9sraXmazba60iYv5dGhttbbzJzWxcDQyGhvycxsqsrtCaQRMSLpWmATUARujogtktam/euBjcBlwDZgELi6Udt06M8Bd0j6OPAs8OFU/k7gP0saAUrA2oh4Je37BPAlYDrw7fTKzZHeQArZUFuQ3UQ6f2ZvTpGZmR1/uT76OiI2kiWX6rL1VdsBXNNq21T+MvDuOuXfAL4xzrE2A+cfSezHolw+uskFAC/vc+Ixs6nNKxfk4EhXp4ZsqA3gpX1DeYRkZjZpOPHkoFQ+ims8PU48ZtYZnHhyUD6a+3hSj2f3XiceM5vanHhycKRL5gBM7y5SlHhpn1eoNrOpzYknB0fT45FEX2/RQ21mNuU58eSgfIRrtVXMnNbFy048ZjbFOfHkoJRWLjiSJXMgm1LtoTYzm+qceHIw1uM5snZZ4nGPx8ymNieeHBzNDaQAJ0/v4cW9Q34gnJlNaU48OTiaJXMA3njaSZTKwd1P/jSPsMzMJgUnnhxUVi440h7PwjnTOf3kaWx87IUcojIzmxyceHJQPoqVCyCbUv2e80/j+0/vZu8BD7eZ2dTkxJODo7mBtOKyNy9geKTMPz31YrvDMjObFJx4cjA0ks2nntZdPOK2F501h1NO6uXbHm4zsynKiScHA+lJojN6jjzxFApixfmn8c9bXxw9jpnZVOLEk4OB4Sxh9PUc3eOOVp6/gKGRMvds3d3OsMzMJgUnnhwMDpcAmNF75D0egIsXz2VeXw8bH9/VzrDMzCaFXBOPpBWStkraJmldnf2SdH3a/6ikC5u1lTRX0nckPZ3e56TySyU9JOmx9P6uqjb3pGM9kl6n5Pm9B4ZG6C6K3q6jSzzFgvjl807jn596kcFhD7eZ2dSSW+KRVARuAFYCS4ErJC2tqbYSWJJea4AbW2i7Drg7IpYAd6fPAC8BvxIRbwauAr5cc64rI+KC9Mp1ytjA0AgzjnKY7bb7n+W2+59lZm8Xg8Ml/tNdW7jt/mfbHKGZ2fGTZ4/nYmBbRGyPiGHgdmBVTZ1VwK2RuQ+YLWlBk7argFvS9i3A+wEi4uGIeD6VbwGmSerN6bs1NDBcou8oJhZUO3veDOb29fDQs6+2KSozs8khz8RzBvBc1ecdqayVOo3anhoRuwDSe71hsw8BD0dE9YqbX0zDbJ+R6i8pIGmNpM2SNu/effQX9geHR5jRe3Q9noqCxFvOms323QO8OugVq81s6sgz8dT75R4t1mmlbf2TSucBnwd+s6r4yjQE9470+mi9thFxU0Qsi4hl/f39rZyuroGhY+/xAFx45hwAbv6fP+IPNmzhxb0HjvmYZmbHW56JZwdwZtXnhcDzLdZp1PanaTiO9D56vUbSQuBOYHVEPFMpj4id6X0vcBvZUF5uBodH6DvGHg/AnL4ePnThQub29XDb/c/y7j/9Hv/sFQ3M7ASXZ+J5EFgiabGkHuByYENNnQ3A6jS7bTmwJw2fNWq7gWzyAOn9LgBJs4FvAddFxA8qJ5DUJWl+2u4G3gs83vZvW2XfUOmoJxfUuujsOVz9tsVs+uQ76Z/Vyx9v2tqW45qZHS+5JZ6IGAGuBTYBTwJ3RMQWSWslrU3VNgLbgW3AXwO/1ahtavM54FJJTwOXps+k+m8APlMzbboX2CTpUeARYGc6V26yHs+xD7VVWzy/j6veuognd73OUy+83tZjm5lNpPb8WT6OiNhIllyqy9ZXbQdwTattU/nLwLvrlH8W+Ow4oVzUetTHbqCNPZ5qv/Lzp/Nf/uEJ7vzhTq67bFbbj29mNhFyTTydanB4pC2TC6pV7uV5wykz+eoDz3Lm3BkUJH7tkrPaeh4zs7x5yZw2K5eDweHSMU+nHs+FZ83h9QMjPLXLw21mdmJy4mmzwYPZOm0z23yNp+JNC2YxZ0Y333/6JSJammFuZjapOPG02eDoIxHy6fEUC+LtS/p59pVBtv50L/vTgqRmZicKJ542G0iJoN2z2qpddNYcZvQUufXen/Dzf/iPfOX+nwDw4t4DfP2hHfzLtpdyO7eZ2bHy5II2G8i5xwPQ01XgE//mZ9j+0gCP79zDp+98nP+68Sn2pXN3F8Wm330n5/TPzC0GM7Oj5cTTZpVn8RztQ+BaNW9mL/Nm9nLR2XO4Z+tuXto3xIKTp7Hg5Onc9sBP+Pdff5Q7fvOtFAt1l6UzMztunHjarNLjyXOorVpB4l1vPHSd1F/5udP5u4d28NlvPcHvv3cp46yJamZ2XDjxtNnoY69zmk7digvOnM3MaV188Qc/pqerwP916c/S0+XLeWY2OTjxtNngUHrsdZtvID0SkviZ/pksO3sOX/jedr6+eQenz57O3L4ePv+hn+O0k6cdt9jMzPxncJuN9nhyvsbTTEHigxcu5Kq3ns3M3i6ef20//+Pp3bzjj/+Jux7ZeVxjM7PO5h5Pm1UmF8yYoGs8zfzsabP42dOydd1eGRjmnq0vsu4bj3He6bN4wyknHefozKwTucfTZvuGRuguit6uyZF4qs3t6+GGKy9kRk+RtX/7Q17elz2g9cldr3Pd3z/G9Xc/zUv7hpocxczs2LjH02aDQyO53sNzrO5+8kU+8JYzuOXeH7Piz/8HJ03vYvvuAaZ1FzhwsMz67z3D5z70c/ybc/t5Yc8Bzj11pmfFmVlbTd7fkCeogeH2PPY6T+f0z+SqX1zEl+/9CSPlMpcuPZXli+exd+ggdz68k9/+6sOI7FnjF5w5m09dei7vWDLfCcjM2sKJp80Gh0dyW5m6nc6ZP5PrVr6JrqIopIQyvafIb7z9HH6w7SWGS2Wmdxd55LnXWH3zA/xMfx8v7Rump6vAsrPnIMHM3i7evHA2bz7jZE6fPY2hg2UOHCyx/2CJgaESg8MjnDStm19YNMdJy8xGTf7fkCeYgaHJ3+OpqHdvT7Eg3nlu/+jnS86Zy0M/eZXHdu7h3FNnMjxS5oEfvUKhIAaGRrhj846m5+k/qZeP/eKilLDEjJ4i07qL7Nl/kMd37mH77n2ce9pJnDN/JqfO6uXseX1eccFsCss18UhaAfwFUAT+JiI+V7Nfaf9lwCDwsYj4YaO2kuYCXwMWAT8G/o+IeDXtuw74OFACfjsiNqXyi4AvAdPJnmr6O5HTMwUGhkaO682j7dZVKHDJ4nlcsnjeYfsigtf2H2Tnq/vTpIoC3UXRXSzQ21Wgp6vA7r1D3Lf9Zf5k09Zxz9FdFAdLY/85ersKnHf6LM6e18dZc2dwyqxeeooF9h4YoSBYNL+Pc+bPZN7MHgaHS+wfLjF4cITB4dLofVRnzp3O9J4ihZTopncX3esymyRy+w0pqQjcAFwK7AAelLQhIp6oqrYSWJJelwA3Apc0absOuDsiPidpXfr8e5KWApcD5wGnA9+VdG5ElNJx1wD3kSWeFcC38/jeA8MlZs/oyePQk44k5szoYU6D77twzgzectYcXj9wkBf2HKAgMTxSZrhUZlp3gVNOmsacGd28OniQVweHeW3wIDtfG+TF14fYvvtF9uw/SDv+QpCye6tm9BTpLhaQsjIhpOy+J5H1+M6cO4PTTp5GVyEbhiwWsle2DUWJQkFj79Xbyo5RSD22iOxaGRGUI7vPq1QKursK9BQL6T1L1l3FAsV0jsp5R88jUVD2My+kc1Rvj9aX0vfKvh+Mfa78HCvfs1gQXYXC6HZRIojRmCPikJ+90n/zse3s54cq5xqrc8h+qvZX/czr1afqHDZ15fmn+cXAtojYDiDpdmAVUJ14VgG3pt7HfZJmS1pA1psZr+0q4JdS+1uAe4DfS+W3R8QQ8CNJ24CLJf0YmBUR96Zj3Qq8n5wSz3mnz2LhnOl5HPqENmtaN7OmdY+7f25fD3P7sgR20dlzRstHSmUGh0uMlINp3QVK5eClfcO8vG+IweES3V0Fekd/gWe9rIjg1cFhDpayX5wHR8oMjZQZHikxNFKmVM5+nVZ+qVZ+wUZAqRw88fzr3P+jV4gIyilhRM275a86/9SmourkdPi+6nY1e8c55lgiVJ2yBuesqVNdr7b9Ifvqxquaz9XnPrL4ah1yzHF+PtXl//jJd+Z6S0ieiecM4LmqzzvIejXN6pzRpO2pEbELICJ2SaqskHkGWY+m9lgH03Zt+WEkrSHrGQHskzT++FATnxzbnA+cKA/IOZFihRMrXseanxMp3hMi1mn/ATj6WM9uViHPxFMv/db+rThenVbatnq+lo8VETcBNzU5zxGRtDkilrXzmHk5kWKFEytex5qfEylex5rJc+WCHcCZVZ8XAs+3WKdR25+m4TjS+4stHGthkzjMzGyC5Jl4HgSWSFosqYfswv+GmjobgNXKLAf2pGG0Rm03AFel7auAu6rKL5fUK2kx2YSFB9Lx9kpanmbRra5qY2ZmEyy3obaIGJF0LbCJbEr0zRGxRdLatH892Qyzy4BtZNOpr27UNh36c8Adkj4OPAt8OLXZIukOsgkII8A1aUYbwCcYm079bXKaWDCOtg7d5exEihVOrHgda35OpHgdK6CcbmcxMzOry6tTm5nZhHLiMTOzCeXEkxNJKyRtlbQtrbBwPGI4U9I/S3pS0hZJv5PK50r6jqSn0/ucqjbXpZi3SnpPVflFkh5L+65XTreXSypKeljSP5wAsc6W9HVJT6Wf8Vsna7ySPpn+DTwu6auSpk2mWCXdLOlFSY9XlbUtvjTp6Gup/H5Ji9oc65+kfwePSrpT0uzJEOt48Vbt+3eSQtL8CY03Ivxq84tsQsQzwDlAD/CvwNLjEMcC4MK0fRLwv4ClwB8D61L5OuDzaXtpirUXWJy+QzHtewB4K9l9Ud8GVuYU86eA24B/SJ8nc6y3AL+RtnuA2ZMxXrIbpn8ETE+f7wA+NpliBd4JXAg8XlXWtviA3wLWp+3Lga+1OdZfBrrS9ucnS6zjxZvKzySbwPUTYP5Extv2/xn9CtJ/nE1Vn68DrpsEcd1Ftv7dVmBBKlsAbK0XZ/pH+dZU56mq8iuAL+QQ30LgbuBdjCWeyRrrLLJf5qopn3TxMrYSyFyymaz/kH5RTqpYyZbKqv5l3rb4KnXSdhfZHflqV6w1+z4AfGWyxDpevMDXgZ8nW2x5/kTG66G2fIy3FNBxk7q/bwHup2bZIaB62aHxljBqadmhY/TnwH8AylVlkzXWc4DdwBfT0ODfSOqbjPFGxE7gT8luP9hFdr/cP07GWGu0M77RNhExAuwBDl9yvT1+nbFbNiZlrJLeB+yMiH+t2TUh8Trx5ONolvzJjaSZwDeA342I1xtVrVN2tEsYHRFJ7wVejIiHWm1Sp2xCYk26yIYvboyItwADZMNB4zmeP9s5ZIvoLiZbub1P0kcaNRknpsny7/po4puQ2CV9muw+wq80Oe9xi1XSDODTwO/X2z3OudsarxNPPlpZLmhCSOomSzpfiYi/T8WTcdmhtwHvU7aa+O3AuyT97SSNtXL+HRFxf/r8dbJENBnj/d+AH0XE7og4CPw98IuTNNZq7YxvtI2kLuBk4JV2BivpKuC9wJWRxp0maaw/Q/ZHyL+m/98WAj+UdNpExevEk49WlgvKXZp18v8CT0bEn1XtmnTLDkXEdRGxMCIWkf28/ikiPjIZY03xvgA8J+lnU9G7yVbNmIzxPgsslzQjnePdwJOTNNZq7Yyv+lj/O9m/r3b2IlaQPZ7lfRExWPMdJlWsEfFYRJwSEYvS/287yCYhvTBh8R7LBSu/Gl7Mu4xsFtkzwKePUwxvJ+vyPgo8kl6XkY2/3g08nd7nVrX5dIp5K1UzloBlwONp319xjBc7m8T9S4xNLpi0sQIXAJvTz/ebwJzJGi/wh8BT6TxfJpu1NGliBb5Kdv2p8hiTj7czPmAa8Hdky3M9AJzT5li3kV3nqPx/tn4yxDpevDX7f0yaXDBR8XrJHDMzm1AeajMzswnlxGNmZhPKicfMzCaUE4+ZmU0oJx4zM5tQTjxmZjahnHjMphBJ79M4j+GQtG+i4zGrx/fxmHUISfsiYubxjsPMPR6z40DSNyU9pOzhbGtS2T5J/4+kH0q6W1J/Kr9H0p9L+hdlD3K7uMFxPybpr9L2Ykn3SnpQ0n+ZmG9m1pwTj9nx8esRcRHZMiS/LWke0Af8MCIuBL4H/Keq+n0R8YtkD926ucVz/AXZ6tm/ALzQvtDNjo0Tj9nx8duS/hW4j2xl3yVkzyH6Wtr/t2Rr7VV8FSAivg/MUtWjlRt4W6Ud2fpsZpNC1/EOwKzTSPolskcVvDUiBiXdQ7bQYq0YZ7ve5/H4Iq5NOu7xmE28k4FXU9J5I7A8lRfIlpUH+DXgf1a1+VUASW8ne4LonhbO8wOyR0wAXHnMUZu1iXs8ZhPv/wPWSnqUbOn5+1L5AHCepIfIHh/8q1VtXpX0L8Asskcrt+J3gNsk/Q7ZwwDNJgVPpzabJMab7pyG4v5dRGye+KjM2s9DbWZmNqHc4zE7AUm6mmwordoPIuKa4xGP2ZFw4jEzswnloTYzM5tQTjxmZjahnHjMzGxCOfGYmdmE+v8B7W7UiphdNIgAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.distplot(df_tr_te_app_events['app_id'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T08:43:53.089075Z",
     "start_time": "2021-08-15T08:43:52.772735Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8        433103\n",
       "2        338678\n",
       "30       304719\n",
       "84       239018\n",
       "3        147057\n",
       "          ...  \n",
       "11906         1\n",
       "13439         1\n",
       "13437         1\n",
       "13434         1\n",
       "9513          1\n",
       "Name: app_id, Length: 13762, dtype: int64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_tr_te_app_events['app_id'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-08T13:01:16.716204Z",
     "start_time": "2021-08-08T13:01:16.697187Z"
    }
   },
   "source": [
    "### 小结\n",
    "\n",
    "从上面的基础分析来看，本次赛题的数据基本符合我们的直观理解。至于其它的细节，大家可以根据自己的需求进行分析。下面我们构建该赛题的Baseline。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-08T12:19:43.122811Z",
     "start_time": "2021-08-08T12:19:43.093323Z"
    }
   },
   "source": [
    "## 模型构建\n",
    "### 特征工程\n",
    "\n",
    "#### tag_list的长度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T08:50:50.227888Z",
     "start_time": "2021-08-15T08:50:45.192787Z"
    }
   },
   "outputs": [],
   "source": [
    "df_tr_te_app_events['tag_list_len'] = df_tr_te_app_events['tag_list'].apply(lambda x: x.count(',')+1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 基础特征工程\n",
    "\n",
    "- event_id：出现次数&不同值；\n",
    "- app_id：出现的不同值；\n",
    "- is_installed：均值和和；\n",
    "- is_active：均值和和；\n",
    "- date：最大值，不同值；\n",
    "- tag_list_len：均值和标准差"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T08:57:52.193172Z",
     "start_time": "2021-08-15T08:57:43.031033Z"
    }
   },
   "outputs": [],
   "source": [
    "agg_dic = {\n",
    "    \"event_id\":['count','nunique'],\n",
    "    \"app_id\":['nunique'], \n",
    "    \"is_installed\":[np.mean,np.sum],\n",
    "    \"is_active\":[np.mean,np.sum],\n",
    "    \"date\":[np.max, 'nunique'],\n",
    "    \"tag_list_len\":[np.mean, np.std] \n",
    "}\n",
    "\n",
    "df_device_features = df_tr_te_app_events.groupby('device_id').agg(agg_dic).reset_index()\n",
    "\n",
    "fea_names = ['_'.join(c) for c in df_device_features.columns]\n",
    "df_device_features.columns = fea_names\n",
    "df_device_features.rename(columns = {'device_id_':'device_id'},inplace = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T08:58:43.056767Z",
     "start_time": "2021-08-15T08:58:43.021297Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>device_id</th>\n",
       "      <th>event_id_count</th>\n",
       "      <th>event_id_nunique</th>\n",
       "      <th>app_id_nunique</th>\n",
       "      <th>is_installed_mean</th>\n",
       "      <th>is_installed_sum</th>\n",
       "      <th>is_active_mean</th>\n",
       "      <th>is_active_sum</th>\n",
       "      <th>date_amax</th>\n",
       "      <th>date_nunique</th>\n",
       "      <th>tag_list_len_mean</th>\n",
       "      <th>tag_list_len_std</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>53</td>\n",
       "      <td>1</td>\n",
       "      <td>53</td>\n",
       "      <td>1</td>\n",
       "      <td>53</td>\n",
       "      <td>0.113208</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>5.188679</td>\n",
       "      <td>4.578615</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>81</td>\n",
       "      <td>3</td>\n",
       "      <td>30</td>\n",
       "      <td>1</td>\n",
       "      <td>81</td>\n",
       "      <td>0.456790</td>\n",
       "      <td>37</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>5.049383</td>\n",
       "      <td>3.488915</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>154</td>\n",
       "      <td>11</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>154</td>\n",
       "      <td>0.727273</td>\n",
       "      <td>112</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>6.201299</td>\n",
       "      <td>3.941367</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>109</td>\n",
       "      <td>5</td>\n",
       "      <td>24</td>\n",
       "      <td>1</td>\n",
       "      <td>109</td>\n",
       "      <td>0.587156</td>\n",
       "      <td>64</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>5.119266</td>\n",
       "      <td>3.338121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>303</td>\n",
       "      <td>6</td>\n",
       "      <td>96</td>\n",
       "      <td>1</td>\n",
       "      <td>303</td>\n",
       "      <td>0.273927</td>\n",
       "      <td>83</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>6.260726</td>\n",
       "      <td>4.535512</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   device_id  event_id_count  event_id_nunique  app_id_nunique  \\\n",
       "0          0              53                 1              53   \n",
       "1          1              81                 3              30   \n",
       "2          2             154                11              20   \n",
       "3          3             109                 5              24   \n",
       "4          4             303                 6              96   \n",
       "\n",
       "   is_installed_mean  is_installed_sum  is_active_mean  is_active_sum  \\\n",
       "0                  1                53        0.113208              6   \n",
       "1                  1                81        0.456790             37   \n",
       "2                  1               154        0.727273            112   \n",
       "3                  1               109        0.587156             64   \n",
       "4                  1               303        0.273927             83   \n",
       "\n",
       "   date_amax  date_nunique  tag_list_len_mean  tag_list_len_std  \n",
       "0          1             1           5.188679          4.578615  \n",
       "1          6             2           5.049383          3.488915  \n",
       "2          5             3           6.201299          3.941367  \n",
       "3          5             4           5.119266          3.338121  \n",
       "4          6             4           6.260726          4.535512  "
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_device_features.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 特征组合\n",
    "\n",
    "- 常用的count/nunqiue & mean/std组合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:02:05.594075Z",
     "start_time": "2021-08-15T09:02:05.532367Z"
    }
   },
   "outputs": [],
   "source": [
    "df_device_features['event_id_count_nunique_ratio'] = df_device_features['event_id_count'] / df_device_features['event_id_nunique']\n",
    "df_device_features['tag_list_len_mean_div_std'] = df_device_features['tag_list_len_mean'] / df_device_features['tag_list_len_std']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-08T12:23:49.840993Z",
     "start_time": "2021-08-08T12:23:49.838892Z"
    }
   },
   "source": [
    "#### 特征拼接"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:03:07.521832Z",
     "start_time": "2021-08-15T09:03:07.422558Z"
    }
   },
   "outputs": [],
   "source": [
    "df_tr = df_tr.merge(df_device_features, on = 'device_id' , how='left')\n",
    "df_te = df_te.merge(df_device_features, on = 'device_id' , how='left')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:03:40.865237Z",
     "start_time": "2021-08-15T09:03:38.462006Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3167"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "del df_tr_te_app_events\n",
    "gc.collect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:04:12.802182Z",
     "start_time": "2021-08-15T09:04:12.782471Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>device_id</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>phone_brand</th>\n",
       "      <th>device_model</th>\n",
       "      <th>event_id_count</th>\n",
       "      <th>event_id_nunique</th>\n",
       "      <th>app_id_nunique</th>\n",
       "      <th>is_installed_mean</th>\n",
       "      <th>is_installed_sum</th>\n",
       "      <th>is_active_mean</th>\n",
       "      <th>is_active_sum</th>\n",
       "      <th>date_amax</th>\n",
       "      <th>date_nunique</th>\n",
       "      <th>tag_list_len_mean</th>\n",
       "      <th>tag_list_len_std</th>\n",
       "      <th>event_id_count_nunique_ratio</th>\n",
       "      <th>tag_list_len_mean_div_std</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>35</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>53</td>\n",
       "      <td>1</td>\n",
       "      <td>53</td>\n",
       "      <td>1</td>\n",
       "      <td>53</td>\n",
       "      <td>0.113208</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>5.188679</td>\n",
       "      <td>4.578615</td>\n",
       "      <td>53.0</td>\n",
       "      <td>1.133242</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>81</td>\n",
       "      <td>3</td>\n",
       "      <td>30</td>\n",
       "      <td>1</td>\n",
       "      <td>81</td>\n",
       "      <td>0.456790</td>\n",
       "      <td>37</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>5.049383</td>\n",
       "      <td>3.488915</td>\n",
       "      <td>27.0</td>\n",
       "      <td>1.447264</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>32</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>154</td>\n",
       "      <td>11</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>154</td>\n",
       "      <td>0.727273</td>\n",
       "      <td>112</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>6.201299</td>\n",
       "      <td>3.941367</td>\n",
       "      <td>14.0</td>\n",
       "      <td>1.573388</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>28</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>109</td>\n",
       "      <td>5</td>\n",
       "      <td>24</td>\n",
       "      <td>1</td>\n",
       "      <td>109</td>\n",
       "      <td>0.587156</td>\n",
       "      <td>64</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>5.119266</td>\n",
       "      <td>3.338121</td>\n",
       "      <td>21.8</td>\n",
       "      <td>1.533577</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>75</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>303</td>\n",
       "      <td>6</td>\n",
       "      <td>96</td>\n",
       "      <td>1</td>\n",
       "      <td>303</td>\n",
       "      <td>0.273927</td>\n",
       "      <td>83</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>6.260726</td>\n",
       "      <td>4.535512</td>\n",
       "      <td>50.5</td>\n",
       "      <td>1.380379</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   device_id  gender  age  phone_brand  device_model  event_id_count  \\\n",
       "0          0       0   35            0             0              53   \n",
       "1          1       1   37            1             1              81   \n",
       "2          2       0   32            1             2             154   \n",
       "3          3       1   28            1             2             109   \n",
       "4          4       0   75            2             3             303   \n",
       "\n",
       "   event_id_nunique  app_id_nunique  is_installed_mean  is_installed_sum  \\\n",
       "0                 1              53                  1                53   \n",
       "1                 3              30                  1                81   \n",
       "2                11              20                  1               154   \n",
       "3                 5              24                  1               109   \n",
       "4                 6              96                  1               303   \n",
       "\n",
       "   is_active_mean  is_active_sum  date_amax  date_nunique  tag_list_len_mean  \\\n",
       "0        0.113208              6          1             1           5.188679   \n",
       "1        0.456790             37          6             2           5.049383   \n",
       "2        0.727273            112          5             3           6.201299   \n",
       "3        0.587156             64          5             4           5.119266   \n",
       "4        0.273927             83          6             4           6.260726   \n",
       "\n",
       "   tag_list_len_std  event_id_count_nunique_ratio  tag_list_len_mean_div_std  \n",
       "0          4.578615                          53.0                   1.133242  \n",
       "1          3.488915                          27.0                   1.447264  \n",
       "2          3.941367                          14.0                   1.573388  \n",
       "3          3.338121                          21.8                   1.533577  \n",
       "4          4.535512                          50.5                   1.380379  "
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_tr.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:04:29.314137Z",
     "start_time": "2021-08-15T09:04:29.308586Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['device_id', 'gender', 'age', 'phone_brand', 'device_model',\n",
       "       'event_id_count', 'event_id_nunique', 'app_id_nunique',\n",
       "       'is_installed_mean', 'is_installed_sum', 'is_active_mean',\n",
       "       'is_active_sum', 'date_amax', 'date_nunique', 'tag_list_len_mean',\n",
       "       'tag_list_len_std', 'event_id_count_nunique_ratio',\n",
       "       'tag_list_len_mean_div_std'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_tr.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-08T12:24:31.990819Z",
     "start_time": "2021-08-08T12:24:31.988238Z"
    }
   },
   "source": [
    "## 模型训练&预测\n",
    "\n",
    "### 特征&标签设计"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:05:32.924409Z",
     "start_time": "2021-08-15T09:05:32.897984Z"
    }
   },
   "outputs": [],
   "source": [
    "tr_features = ['device_id',  'phone_brand', 'device_model', 'event_id_count', 'event_id_nunique', 'app_id_nunique',\n",
    "       'is_installed_mean', 'is_installed_sum', 'is_active_mean', 'is_active_sum', 'date_amax', 'date_nunique', 'tag_list_len_mean',\n",
    "       'tag_list_len_std', 'event_id_count_nunique_ratio','tag_list_len_mean_div_std']\n",
    "label_gender   = 'gender'\n",
    "label_age      = 'age'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### gender模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:12:59.135252Z",
     "start_time": "2021-08-15T09:12:59.129781Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    2668\n",
       "1    1347\n",
       "Name: gender, dtype: int64"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_valid.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:19:26.153937Z",
     "start_time": "2021-08-15T09:19:20.704007Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/wangrong/opt/anaconda3/lib/python3.8/site-packages/lightgbm/engine.py:153: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument\n",
      "  _log_warning(\"Found `{}` in params. Will use it instead of argument\".format(alias))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[LightGBM] [Info] Number of positive: 5742, number of negative: 10318\n",
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002834 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3032\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n",
      "[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.357534 -> initscore=-0.586082\n",
      "[LightGBM] [Info] Start training from score -0.586082\n",
      "Training until validation scores don't improve for 100 rounds\n",
      "[100]\ttraining's binary_error: 0.349004\tvalid_1's binary_error: 0.343711\n",
      "[200]\ttraining's binary_error: 0.325903\tvalid_1's binary_error: 0.341968\n",
      "Early stopping, best iteration is:\n",
      "[138]\ttraining's binary_error: 0.341034\tvalid_1's binary_error: 0.340722\n",
      "[LightGBM] [Info] Number of positive: 5581, number of negative: 10479\n",
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001524 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3030\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n",
      "[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.347509 -> initscore=-0.630005\n",
      "[LightGBM] [Info] Start training from score -0.630005\n",
      "Training until validation scores don't improve for 100 rounds\n",
      "[100]\ttraining's binary_error: 0.338667\tvalid_1's binary_error: 0.382316\n",
      "[200]\ttraining's binary_error: 0.322354\tvalid_1's binary_error: 0.378082\n",
      "[300]\ttraining's binary_error: 0.306476\tvalid_1's binary_error: 0.376837\n",
      "Early stopping, best iteration is:\n",
      "[280]\ttraining's binary_error: 0.310772\tvalid_1's binary_error: 0.374595\n",
      "[LightGBM] [Info] Number of positive: 5715, number of negative: 10345\n",
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001416 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3037\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n",
      "[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.355853 -> initscore=-0.593409\n",
      "[LightGBM] [Info] Start training from score -0.593409\n",
      "Training until validation scores don't improve for 100 rounds\n",
      "[100]\ttraining's binary_error: 0.349564\tvalid_1's binary_error: 0.344956\n",
      "[200]\ttraining's binary_error: 0.330573\tvalid_1's binary_error: 0.347198\n",
      "Early stopping, best iteration is:\n",
      "[140]\ttraining's binary_error: 0.34396\tvalid_1's binary_error: 0.34396\n",
      "[LightGBM] [Info] Number of positive: 5681, number of negative: 10379\n",
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000921 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3036\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n",
      "[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.353736 -> initscore=-0.602657\n",
      "[LightGBM] [Info] Start training from score -0.602657\n",
      "Training until validation scores don't improve for 100 rounds\n",
      "[100]\ttraining's binary_error: 0.35193\tvalid_1's binary_error: 0.356912\n",
      "[200]\ttraining's binary_error: 0.331756\tvalid_1's binary_error: 0.353176\n",
      "Early stopping, best iteration is:\n",
      "[132]\ttraining's binary_error: 0.340722\tvalid_1's binary_error: 0.34944\n",
      "[LightGBM] [Info] Number of positive: 5777, number of negative: 10283\n",
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000947 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3038\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n",
      "[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.359714 -> initscore=-0.576608\n",
      "[LightGBM] [Info] Start training from score -0.576608\n",
      "Training until validation scores don't improve for 100 rounds\n",
      "[100]\ttraining's binary_error: 0.352927\tvalid_1's binary_error: 0.335243\n",
      "Early stopping, best iteration is:\n",
      "[85]\ttraining's binary_error: 0.356413\tvalid_1's binary_error: 0.332752\n"
     ]
    }
   ],
   "source": [
    "lgb_params = {\n",
    "      \"objective\": \"binary\", \n",
    "      \"metric\": \"binary_error\", \n",
    "      \"boosting_type\": \"gbdt\",\n",
    "      'early_stopping_rounds': 100,\n",
    "      'learning_rate': 0.01,  \n",
    "      'colsample_bytree':0.95, \n",
    "} \n",
    "\n",
    "X_tr_val = df_tr[tr_features + [label_gender]]\n",
    "X_te     = df_te[tr_features]\n",
    " \n",
    "kf = KFold(n_splits=5)\n",
    "lgb_gender_models = []\n",
    "y_pred = 0\n",
    "for f,(tr_ind,val_ind) in enumerate(kf.split(X_tr_val)):\n",
    "    \n",
    "    X_train,X_valid = X_tr_val.iloc[tr_ind][tr_features], X_tr_val.iloc[val_ind][tr_features]\n",
    "    y_train,y_valid = X_tr_val.iloc[tr_ind][label_gender], X_tr_val.iloc[val_ind][label_gender]\n",
    "    \n",
    "    lgbm_train = lgbm.Dataset(X_train,y_train)  \n",
    "    lgbm_valid = lgbm.Dataset(X_valid,y_valid)\n",
    "\n",
    "    model_binary = lgbm.train(params=lgb_params, \n",
    "                  train_set=lgbm_train,\n",
    "                  valid_sets=[lgbm_train, lgbm_valid],\n",
    "                  num_boost_round=100000,   \n",
    "                  verbose_eval=100)\n",
    "    y_pred += model_binary.predict(X_te[tr_features]) / 5.0\n",
    "    lgb_gender_models.append(model_binary) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:19:26.161220Z",
     "start_time": "2021-08-15T09:19:26.156042Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df_submit = df_te[['device_id']].copy()\n",
    "df_submit['gender'] = (y_pred >=0.5) + 0\n",
    "df_submit['gender'] = df_submit['gender'].astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### age模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:19:39.234554Z",
     "start_time": "2021-08-15T09:19:28.043121Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002004 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3032\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n",
      "[LightGBM] [Info] Start training from score 30.000000\n",
      "Training until validation scores don't improve for 100 rounds\n",
      "[100]\ttraining's l1: 6.78229\ttraining's Age Error: 0.128497\tvalid_1's l1: 8.09957\tvalid_1's Age Error: 0.109895\n",
      "[200]\ttraining's l1: 6.60257\ttraining's Age Error: 0.131534\tvalid_1's l1: 8.01598\tvalid_1's Age Error: 0.110914\n",
      "[300]\ttraining's l1: 6.47553\ttraining's Age Error: 0.13377\tvalid_1's l1: 7.9676\tvalid_1's Age Error: 0.111513\n",
      "[400]\ttraining's l1: 6.38702\ttraining's Age Error: 0.135373\tvalid_1's l1: 7.96073\tvalid_1's Age Error: 0.111598\n",
      "Early stopping, best iteration is:\n",
      "[375]\ttraining's l1: 6.40741\ttraining's Age Error: 0.135\tvalid_1's l1: 7.95786\tvalid_1's Age Error: 0.111634\n",
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000924 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3030\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n",
      "[LightGBM] [Info] Start training from score 30.000000\n",
      "Training until validation scores don't improve for 100 rounds\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/wangrong/opt/anaconda3/lib/python3.8/site-packages/lightgbm/engine.py:153: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument\n",
      "  _log_warning(\"Found `{}` in params. Will use it instead of argument\".format(alias))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[100]\ttraining's l1: 6.89408\ttraining's Age Error: 0.126677\tvalid_1's l1: 7.58618\tvalid_1's Age Error: 0.116466\n",
      "[200]\ttraining's l1: 6.70926\ttraining's Age Error: 0.129714\tvalid_1's l1: 7.47627\tvalid_1's Age Error: 0.117976\n",
      "[300]\ttraining's l1: 6.58554\ttraining's Age Error: 0.13183\tvalid_1's l1: 7.42871\tvalid_1's Age Error: 0.118642\n",
      "[400]\ttraining's l1: 6.48646\ttraining's Age Error: 0.133575\tvalid_1's l1: 7.405\tvalid_1's Age Error: 0.118977\n",
      "[500]\ttraining's l1: 6.40254\ttraining's Age Error: 0.135089\tvalid_1's l1: 7.39734\tvalid_1's Age Error: 0.119085\n",
      "[600]\ttraining's l1: 6.33728\ttraining's Age Error: 0.13629\tvalid_1's l1: 7.39445\tvalid_1's Age Error: 0.119126\n",
      "[700]\ttraining's l1: 6.27831\ttraining's Age Error: 0.137395\tvalid_1's l1: 7.39564\tvalid_1's Age Error: 0.119109\n",
      "Early stopping, best iteration is:\n",
      "[632]\ttraining's l1: 6.31788\ttraining's Age Error: 0.136652\tvalid_1's l1: 7.39393\tvalid_1's Age Error: 0.119134\n",
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000996 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3037\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/wangrong/opt/anaconda3/lib/python3.8/site-packages/lightgbm/engine.py:153: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument\n",
      "  _log_warning(\"Found `{}` in params. Will use it instead of argument\".format(alias))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[LightGBM] [Info] Start training from score 30.000000\n",
      "Training until validation scores don't improve for 100 rounds\n",
      "[100]\ttraining's l1: 7.11433\ttraining's Age Error: 0.123239\tvalid_1's l1: 6.59517\tvalid_1's Age Error: 0.131663\n",
      "[200]\ttraining's l1: 6.91579\ttraining's Age Error: 0.12633\tvalid_1's l1: 6.54172\tvalid_1's Age Error: 0.132596\n",
      "[300]\ttraining's l1: 6.77959\ttraining's Age Error: 0.128542\tvalid_1's l1: 6.53539\tvalid_1's Age Error: 0.132707\n",
      "[400]\ttraining's l1: 6.68062\ttraining's Age Error: 0.130198\tvalid_1's l1: 6.52882\tvalid_1's Age Error: 0.132823\n",
      "[500]\ttraining's l1: 6.59194\ttraining's Age Error: 0.131719\tvalid_1's l1: 6.52555\tvalid_1's Age Error: 0.132881\n",
      "Early stopping, best iteration is:\n",
      "[493]\ttraining's l1: 6.59739\ttraining's Age Error: 0.131624\tvalid_1's l1: 6.52537\tvalid_1's Age Error: 0.132884\n",
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001271 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3036\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n",
      "[LightGBM] [Info] Start training from score 30.000000\n",
      "Training until validation scores don't improve for 100 rounds\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/wangrong/opt/anaconda3/lib/python3.8/site-packages/lightgbm/engine.py:153: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument\n",
      "  _log_warning(\"Found `{}` in params. Will use it instead of argument\".format(alias))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[100]\ttraining's l1: 7.10475\ttraining's Age Error: 0.123384\tvalid_1's l1: 6.68054\tvalid_1's Age Error: 0.130199\n",
      "[200]\ttraining's l1: 6.91294\ttraining's Age Error: 0.126375\tvalid_1's l1: 6.58465\tvalid_1's Age Error: 0.131845\n",
      "[300]\ttraining's l1: 6.77547\ttraining's Age Error: 0.12861\tvalid_1's l1: 6.55076\tvalid_1's Age Error: 0.132437\n",
      "[400]\ttraining's l1: 6.67143\ttraining's Age Error: 0.130354\tvalid_1's l1: 6.53571\tvalid_1's Age Error: 0.132702\n",
      "[500]\ttraining's l1: 6.58621\ttraining's Age Error: 0.131818\tvalid_1's l1: 6.52837\tvalid_1's Age Error: 0.132831\n",
      "[600]\ttraining's l1: 6.51966\ttraining's Age Error: 0.132985\tvalid_1's l1: 6.52825\tvalid_1's Age Error: 0.132833\n",
      "Early stopping, best iteration is:\n",
      "[545]\ttraining's l1: 6.55478\ttraining's Age Error: 0.132366\tvalid_1's l1: 6.52719\tvalid_1's Age Error: 0.132852\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/wangrong/opt/anaconda3/lib/python3.8/site-packages/lightgbm/engine.py:153: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument\n",
      "  _log_warning(\"Found `{}` in params. Will use it instead of argument\".format(alias))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001132 seconds.\n",
      "You can set `force_col_wise=true` to remove the overhead.\n",
      "[LightGBM] [Info] Total Bins 3038\n",
      "[LightGBM] [Info] Number of data points in the train set: 16060, number of used features: 15\n",
      "[LightGBM] [Info] Start training from score 30.000000\n",
      "Training until validation scores don't improve for 100 rounds\n",
      "[100]\ttraining's l1: 7.07656\ttraining's Age Error: 0.123815\tvalid_1's l1: 6.79393\tvalid_1's Age Error: 0.128305\n",
      "[200]\ttraining's l1: 6.8788\ttraining's Age Error: 0.126923\tvalid_1's l1: 6.71435\tvalid_1's Age Error: 0.129629\n",
      "[300]\ttraining's l1: 6.74541\ttraining's Age Error: 0.129109\tvalid_1's l1: 6.68214\tvalid_1's Age Error: 0.130172\n",
      "[400]\ttraining's l1: 6.63994\ttraining's Age Error: 0.130891\tvalid_1's l1: 6.66929\tvalid_1's Age Error: 0.13039\n",
      "[500]\ttraining's l1: 6.55454\ttraining's Age Error: 0.132371\tvalid_1's l1: 6.67073\tvalid_1's Age Error: 0.130366\n",
      "Early stopping, best iteration is:\n",
      "[480]\ttraining's l1: 6.56954\ttraining's Age Error: 0.132108\tvalid_1's l1: 6.66866\tvalid_1's Age Error: 0.130401\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_absolute_error\n",
    "def feval_lgb_Age(preds, lgbm_train):\n",
    "    \n",
    "    labels =lgbm_train.get_label()\n",
    "    return 'Age Error', round(1.0 / (1.0 + mean_absolute_error(y_true = labels, y_pred = preds)),7), True\n",
    "\n",
    "\n",
    "lgb_params = {\n",
    "      \"objective\": \"mae\",  \n",
    "      \"boosting_type\": \"gbdt\",\n",
    "      'early_stopping_rounds': 100,\n",
    "      'learning_rate': 0.01,  \n",
    "      'colsample_bytree':0.95, \n",
    "} \n",
    "\n",
    "X_tr_val = df_tr[tr_features + [label_age]]\n",
    "X_te     = df_te[tr_features]\n",
    " \n",
    "kf = KFold(n_splits=5)\n",
    "lgb_age_models = []\n",
    "y_pred = 0\n",
    "for f,(tr_ind,val_ind) in enumerate(kf.split(X_tr_val)):\n",
    "    \n",
    "    X_train,X_valid = X_tr_val.iloc[tr_ind][tr_features], X_tr_val.iloc[val_ind][tr_features]\n",
    "    y_train,y_valid = X_tr_val.iloc[tr_ind][label_age], X_tr_val.iloc[val_ind][label_age]\n",
    "    \n",
    "    lgbm_train = lgbm.Dataset(X_train,y_train)  \n",
    "    lgbm_valid = lgbm.Dataset(X_valid,y_valid)\n",
    "\n",
    "    model_mae = lgbm.train(params=lgb_params, \n",
    "                  train_set=lgbm_train,\n",
    "                  valid_sets=[lgbm_train, lgbm_valid],\n",
    "                  num_boost_round=100000,   \n",
    "                     feval = feval_lgb_Age,\n",
    "                  verbose_eval=100)\n",
    "    y_pred += model_mae.predict(X_te[tr_features]) / 5.0\n",
    "    lgb_age_models.append(model_mae) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:19:39.241227Z",
     "start_time": "2021-08-15T09:19:39.237230Z"
    }
   },
   "outputs": [],
   "source": [
    "\n",
    "df_submit['age'] = y_pred"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模型提交"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-08-15T09:19:39.889227Z",
     "start_time": "2021-08-15T09:19:39.853504Z"
    }
   },
   "outputs": [],
   "source": [
    "df_submit.to_csv('baseline_Fold5_lgb.csv',index = None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 结论\n",
    "\n",
    "本文我们对数据进行了简单的分析并且设计了简单的Baseline，目前模型都还是一些基础的统计特征，可以提升的地方非常多，例如：\n",
    "\n",
    "- 加强特征工程：比如tag list数据如何充分的利用好；设计序列化的信息来强化用户的刻画；\n",
    "- 该问题是多目标优化的问题，可以参考Wechat和腾讯之前的比赛，使用一些最新的技术，例如MMOE等进行联合优化；\n",
    "- 设计NN和树模型等融合"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
